Useful tools for easy explatory data analysis (EDA)
Off the shelf and simple functions for data analysis
- pandas_profile
- pyviz
- resumetable
- feature_transform (my library)
I will explore the transformations using the wine data set from Kaggle.
from pathlib import Path
import pandas as pd
#import numpy as np
#from scipy.stats import kurtosis, skew
from scipy import stats
# import math
# import warnings
# warnings.filterwarnings("error")
from google.colab import drive
mnt=drive.mount('/content/gdrive', force_remount=True)
root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'redwine'
csv_path = (base_dir+'/winequality-red.csv')
df=pd.read_csv(csv_path)
# https://gist.github.com/harperfu6/5ea565ee23aaf8461a840c480490cd9a
pd.set_option("display.max_rows", 1000)
def resumetable(df):
print(f'Dataset Shape: {df.shape}')
summary = pd.DataFrame(df.dtypes, columns=['dtypes'])
summary = summary.reset_index()
summary['Name'] = summary['index']
summary = summary[['Name', 'dtypes']]
summary['Missing'] = df.isnull().sum().values
summary['Uniques'] = df.nunique().values
summary['First Value'] = df.loc[0].values
summary['Second Value'] = df.loc[1].values
summary['Third Value'] = df.loc[2].values
for name in summary['Name'].value_counts().index:
summary.loc[summary['Name'] == name, 'Entropy'] = \
round(stats.entropy(df[name].value_counts(normalize=True), base=2), 2)
return summary
Typically, the first thing to do is examine the first rows of data, but this just gives you a very rudimentary feel for the data.
df = pd.read_csv(csv_path)
df.head()
I found resumetable() to be very convenient. We get a sense of cardinaltiy from Uniques and we can easily see where we are missing data.
Also, knowing the datatypes of each column is helpful when in comes to pre-processing the data.
I came across this funtion on Kaggle(I think) and found it incredibly helpful
resumetable(df)
Another tool I use is pandas_profiling
import sys
!"{sys.executable}" -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension
from ipywidgets import widgets
# Our package
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file
profile = ProfileReport(df, title="red wine", html={"style": {"full_width": True}}, sort="None")
Takes a couple minutes to process and display the results even with a small dataset.
You do get some richer analysis like Correlation plots and distributions of the variables.
profile.to_widgets()
An alternative is Sweetviz. I tend to like this a bit better for its display of distributions. In general, it also loads a bit quicker.
!pip -q install sweetviz
import sweetviz as sv
sweet_report = sv.analyze(df)
sweet_report.show_notebook(w=1200.)
I wanted a simple way to view the distributions of the features and more importantly a way to view the data after numerical transformations such as Box-Cox or a log transform.
The following plot is a sample of what I developed. The first plot is the input and the next plots show the various transformations along with their skew and kurtosis. What is highlighted in pink is the transform that was automatically selected to yield the most gausian distribution.
See this
